Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Uncertainty estimation is crucial for understanding the reliability of deep learning (DL) predictions, and critical for deploying DL in the clinic. Differences between training and production datasets can lead to incorrect predictions with underestimated uncertainty. To investigate this pitfall, we benchmarked one pointwise and three approximate Bayesian DL models for predicting cancer of unknown primary, using three RNA-seq datasets with 10,968 samples across 57 cancer types. Our results highlight that simple and scalable Bayesian DL significantly improves the generalisation of uncertainty estimation. Moreover, we designed a prototypical metric—the area between development and production curve (ADP), which evaluates the accuracy loss when deploying models from development to production. Using ADP, we demonstrate that Bayesian DL improves accuracy under data distributional shifts when utilising ‘uncertainty thresholding’. In summary, Bayesian DL is a promising approach for generalising uncertainty, improving performance, transparency, and safety of DL models for deployment in the real world.

Recent advances in deep learning (DL) have led to the rapid development of diagnostic and treatment support applications in various aspects of healthcare, including oncology [1][2][3][4] . The proposed applications of DL utilise a range of data modalities, including MRI scans 5 , CT scans 6 , histopathology slides 7 , genomics 8 , transcriptomics 9,10 , and most recently, integrated approaches with various data types 11,12 . In general, studies using DL show excellent predictive performance, providing hope for successful translation into clinical practice 13,14 . However, prediction accuracy in DL comes with potential pitfalls which need to be overcome before wider adoption can be eventuated 15 .
The lack of transparency over prediction reliability is one challenge for implementing DL 16 . One approach to overcome this is by providing uncertainty estimates about a model's prediction 17,18 , enabling better-informed decision making. Another obstacle relates to the assumptions made about data when transitioning from training to real-world applications. In standard DL practice, during the 'development' stage, models are trained and validated on data prepared to satisfy the assumption of independent and identically distributed (IID) data, meaning that model would be applied to make predictions on the data that are independently drawn and come from the same distribution as the training data. However, this assumption cannot be guaranteed and is, in fact, frequently violated when models are deployed in 'production' (i.e. real-world application). This is because confounding variables, which we cannot control for, cause distributional shifts that push data out-of-distribution (OOD) 19 . For oncology applications, confounding variables can include technical differences in how the data are collected (e.g., batch effects, differences in sequencing depth or library choice for genomic and transcriptomic data; differences in instrumentation and imaging settings for medical imaging data), as well as biological differences (e.g., differences in patient demographics or a data class unseen during model development). The consequences from

Results
Bayesian model benchmarking approach to predict cancer of unknown primary. The primary DL task was to predict the tissue of origin (primary cancer type) of cancer samples using transcriptomic data. We used transcriptomic data from TCGA of primary cancer samples corresponding to 32 primary cancer types as model 'development' data: training (n = 8202 31 ) and validation IID data (n = 1434; Supplementary Table S1). The test data were OOD (representing 'production'), providing a platform for benchmarking resilience to overconfidence, and included TCGA metastatic samples (n = 392 32 ), Met500 metastatic samples (n = 479 33 ), and a combination of primary and metastatic samples from our own independent internal custom dataset, i.e. ICD (n = 461 [34][35][36][37][38][39][40][41][42] ; Fig. 1a, Supplementary Fig. S1). The distributional shifts in the test data were likely to be caused by several factors, including dataset batches, sample metastasis status (metastatic or primary) and whether the cancer type was absent during training ('unseen').
We aimed to evaluate if three simple 'distribution-wise' Bayesian DL models improve performance and reduce shift-induced overconfidence compared to a pointwise baseline model (with identical Resnet architecture). To achieve this, we performed controlled benchmarking of the models over IID and OOD data (Fig. 1b). The experiment was controlled by enforcing consistency for factors affecting uncertainty within the validation/IID dataset. Specifically, all models had identical architecture, hyperparameter, and optimisation settings. Importantly, all models had identical (negative log likelihood) loss within the validation/IID dataset. We intentionally did not perform hyperparameter optimisation for each model, as it was important for our study design to control for accuracy.
The Bayesian models were Monte Carlo Dropout approximation ('MCD') 43 , MCD with smoothness and sensitivity constraints ('Bilipschitz') 44,45 , and an ensemble of Bilipschitz models ('Ensemble') 45 . The ways in which models differed were canonical: MCD modified Resnet by keeping Dropout during prediction, Bilipschitz modified MCD with spectral normalisation, Ensemble modified Bilipschitz by combining multiple models.
Approximate Bayesian inference reduces shift-induced overconfidence for 'seen' classes in a primary cancer site context. The predictive performance of each model to predict primary tissue was assessed using micro-F1 (equivalent to Accuracy; abbreviated F1). For the IID validation data, the difference between the highest and lowest ranking models was 0.28% (97.07% for Resnet and 96.79% for Ensemble, respectively; Fig. 2a, Supplementary Fig. S2-S5). This was anticipated, since the loss was controlled for within validation data. As expected, F1 scores dropped for the OOD test set across all four models, with a 1.74% difference between the highest and lowest ranking models (82.04% for Ensemble and 80.30% for Resnet, respectively; Fig. 2b, Supplementary Figs. S6-S9). All models had higher predictive uncertainties (Shannon's entropy II) for OOD, relative to IID data (Fig. 2b). Uncertainties were significantly higher for all approximate Bayesian models (MCD, Bilipschitz, and Ensemble) relative to (pointwise) Resnet (p < 0.0001). Moreover, overconfidence in OOD data was evident for the Resnet and MCD models since their binned accuracies (i.e., the correct classification Overview of the study design. (a) Simplified study workflow. TCGA primary cancer types comprised the training and IID validation data. OOD test data comprised of the TCGA (metastatic cancer types), Met500 and ICD datasets, which included primary, metastatic and 'unseen' cancer types. (b) Schematic overview of the four tested models: pointwise Resnet (Resnet), Resnet extended with Monte Carlo Dropout (MCD), MCD extended with bi-Lipschitz constraint (Bilipschitz), and an ensemble of Bilipschitz models (Ensemble). Note, Resnet represents a single point in function space (blue dot), while two Bayesian models (MCD and Bilipschitz) represent a distribution within a single region in function space (green dots). The Ensemble represents a collection of distributions centred around different modes (red dots). www.nature.com/scientificreports/ rates within bins delineated by the confidence scores) were consistently lower than corresponding confidence scores (Fig. 2c). The expected calibration errors (ECEs) for OOD data ranged between 5% for Ensemble and Bilipschitz and 16% for Resnet (Fig. 2c). Estimation of overconfidence as an absolute error was negligible across all models for IID data, with high amounts of overconfidence for OOD data, highlighting the shift-induced overconfidence when transitioning from IID to OOD data (Fig. 2d). Furthermore, Resnet had significantly higher overconfidence than MCD (p value < 0.01), Bilipschitz (p value < 0.001), and Ensemble (p value < 0.001) for OOD data but not IID data. This shows that the shift-induced overconfidence in pointwise DL models can be reduced with simple (approximate) Bayesian inference.
Prediction overconfidence for 'unseen' classes explained by related primary cancer types. Classes absent from training ('unseen') cannot have correct predictions, and prediction uncertainties should be higher compared to 'seen' classes. As expected, mean total uncertainties were higher for 'unseen'  www.nature.com/scientificreports/ classes for all models (Fig. 3a). Moreover, approximate Bayesian models were significantly more uncertain with 'unseen' classes compared to Resnet (p value < 0.01; Fig. 3a). However, exceptions occurred across all models, where total uncertainty values were low, at both: class level, where predictions for a whole 'unseen' class consistently had low uncertainty; and sample level, where predictions for only some samples from a class had low uncertainty (Fig. 3b). We wanted to investigate whether any of the exceptions could be examples of 'silent catastrophic failure' (Supplementary Information-S4.2), a phenomenon where data are far from the training data's support, resulting in incorrect yet extremely confident predictions [44][45][46] . 'Unseen' classes (i.e., cancer types) with low levels of uncertainty (averaged within the class) corresponded to 'seen' classes that either (biologically) related to the predicted primary cancer type, or were from a similar tissue or cell of origin. For example, all acral melanoma (ACRM) samples (n = 40), a subtype of melanoma that occurs on soles, palms and nail beds, were predicted to be cutaneous melanoma (MEL) by all four models (Supplementary Figs. S6-S9) with the smallest median total uncertainty for all four models (Fig. 3b). All three fibrolamellar carcinoma (FLC) samples, a rare type of liver cancer, were predicted to be hepatocellular carcinomas (HCC), although the median uncertainty was much higher for Bilipschitz and Ensemble models compared to Resnet and MCD (1.8, 1.5, 0.1 and 0.29 Shannon's Entropy II, respectively). Two bladder squamous cell carcinomas (BLSC) showed different examples of class-level exceptions with one sample predicted as a bladder adenocarcinoma (BLCA), with the same primary tissue site as BLSC, or a lung squamous carcinoma (LUSC), with similar cell of origin. For the 'unseen' class pancreatic neuroendocrine tumours (PANET) we saw a wide spread of uncertainty values (Fig. 3b). Interestingly, only PANET samples that were predicted as another subtype of pancreatic cancer, pancreatic adenocarcinomas (PAAD), had low prediction uncertainty across all models compared to other incorrectly predicted PANET samples ( Supplementary Fig. S10). Overall, since most of the incorrect predictions with low uncertainties had a reasonable biological explanation for the prediction, we concluded that we did not find strong evidence of catastrophic silent failure in this case study.

Robustness to shift-induced overconfidence is integral for production inference.
To evaluate the robustness of the models' accuracy, as well as the uncertainty's correlation with the error-rate (abbreviated "uncertainty's error-rate correlation") we used the F1-Retention Area Under the Curve (F1-AUC) 47 . Evaluation was carried out on 'seen' and 'unseen' OOD data (i.e., 'production data'). All models yielded similar results, with only a 0.45% percent decrease between the highest and lowest ranking models (F1-AUC of 93.67% for Bilipschitz and 93.25% for MCD, respectively; Fig. 4a). The performance difference between all models was marginal as F1-AUC doesn't capture the lost calibration caused by the distributional shift when transitioning from IID to ('seen' and 'unseen') OOD. In other words, the F1-AUC metric did not detect effects caused by the shift-induced overconfidence. This was evident from the following observations: (1) inter-model accuracies were similar within IID, as well as OOD data (Fig. 2a); (2) calibration errors (i.e. overconfidence) were not different for IID (p value > 0.05), but different for OOD (p value < 0.01; Fig. 2d); and (3) F1-AUC scores were similar for all models, which implies 'uncertainty's error-rate correlation' must have been similar (since F1-AUC encapsulates accuracy and 'uncertainty's error-rate correlation' 47 ). Thus, while we showed that F1-AUC encapsulated accuracy and 'uncertainty's error-rate correlation' , both of which are important components of robustness when deploying DL in production, we caution that F1-AUC does not encapsulate robustness to shift-induced overconfidence. Hence it is not sufficient for safe deployment in clinical practice.   www.nature.com/scientificreports/ To overcome the limitation of the F1-AUC metric's insensitivity to shift-induced overconfidence, we developed a new (prototypical) metric called the Area between the Development and Production curve (ADP), which depends on both IID (i.e., 'development') data, as well as the ('seen' and 'unseen') OOD (i.e., 'production') data. The ADP may be interpreted as "the expected decrease in accuracy when transitioning from development to production if uncertainty thresholding is utilised to boost reliability". The ADP differs from ECE and Accuracy in two primary ways. First, ECE and accuracy relate to a single data set, whereas the ADP relates to two data sets, hence ADP explains the expected change in, for example, accuracy from one data set relative to the other. Second, the ADP complements and subsumes F1-AUC in the context of deploying models from training/development data (IID) to production test data (OOD). The ADP was calculated by averaging the set of decreases in F1, from development (IID) to production (OOD) datasets, at multiple different uncertainty thresholds (a single F1-decrease is demonstrated in Fig. 4b; refer to the "Methods" section for details).
The ADP metric detected effects from shift-induced overconfidence, with an inter-model percent decrease that was two orders of magnitude larger than F1-AUC (Fig. 4c). The percent decrease between the top and bottom ranking models was 53.68%. The top-ranking model was Bilipschitz with an ADP of 4.28%, and the bottom ranking model was Resnet with ADP of 9.24% (Fig. 4c). This highlights that ADP may be relevant when evaluating the performance of models that are deployed in production by encapsulating shift-induced overconfidence, which is inevitable in an oncological setting.
To further illustrate the utility of ADP, we performed an additional experiment ( Supplementary Fig. S11). We used an independent classification task, the well-known CIFAR-10 (IID) dataset and its' OOD variant-CIFAR-10-C, and compared a non-Bayesian CNN Resnet model and a Deep Kernel Learning Model (i.e., neural Gaussian process). The results were in line with our hypothesis that Bayesian deep learning improves robustness to distribution shift, demonstrated by a lower ADP for the Gaussian process model compared to the Resnet model.

Discussion
A major barrier to using DL in clinical practice is the shift-induced overconfidence encountered when deploying a DL model from development to production. Reducing and accounting for shift-induced overconfidence with appropriate models and relevant metrics should make the models more transparent and trustworthy for translation into practice. Our work herein shows that marked progress can be made with simple Bayesian DL models deployed in conjunction with uncertainty thresholding. However, the performance of models deployed in production can be difficult to evaluate without a suitable metric, therefore we developed ADP to directly measure shift-induced overconfidence.
Three Bayesian models with canonical extensions, namely MCD, Bilipschitz, Ensemble, were chosen to test whether simple modifications applicable to any DL architecture can improve performance in production. The Bayesian models were selected according to criteria for which we believe would facilitate adoption: (1) simplicity, for wider accessibility; (2) ubiquity, to ensure models were accepted and tested methods; (3) already demonstrated as robust to shift-induced overconfidence 22,48,49 ; and (4) computational scalability. Our prior expectations were that each canonical extension would further improve generalisation of both accuracy and uncertainty quality, albeit at the cost of increased complexity. Those expectations were mostly in line with our benchmarking results, since the most complex model (Ensemble) went from worst-performing in IID to best-performing model in OOD in terms of accuracy. Furthermore, while inspection into overconfidence presented no significant inter-model differences within IID data, the OOD overconfidence differences were significant, whereby added complexity corresponded to less shift-induced overconfidence. Using the ADP statistic, improvements in robustness to shift-induced overconfidence were shown to have a large impact on the accuracy in production when rejecting unreliable predictions above an acceptable uncertainty threshold. Hence, any DL architecture's accuracy in production can be substantially improved with simple and scalable approximate Bayesian modifications. This phenomenon is sometimes referred to as "turning the Bayesian crank" 50 .
We restricted our uncertainty statistics to predictive (i.e., total) uncertainties, since it was not possible to estimate the sub-divisions of uncertainty with the baseline Resnet model, which only captures uncertainty about the data. The Bayesian models captured an additional component of uncertainty, the 'epistemic' uncertainty, hence they all had larger total uncertainty estimates when compared to the non-Bayesian baseline. Consequently, the Bayesian models filled the uncertainty gap caused by distribution shift (i.e., shift-induced overconfidence). In future work, a richer picture may be understood by focusing only on distribution-wise models to inspect the two sub-divisions of the predictive uncertainty: epistemic (model) uncertainty and aleatoric (inherent) uncertainty. Epistemic uncertainty is dependent on the model specification and may be reduced with more data or informative priors. Aleatoric uncertainty is dependent on data's inherent noise and can be reduced with more data features that explain variance caused by confounding variables (e.g., patient age, cancer stage, batch effect). Epistemic and aleatoric uncertainties present the potential for further insights, including whether a data point's predictive uncertainty will reduce with either more examples or by an altered model design (epistemic uncertainty), or more features (aleatoric uncertainty) [51][52][53][54] .
This study addressed distributional shift effects on uncertainties with parametric models, which assume parameters are sufficient to represent all training data. Non-parametric models relax that assumption, which is arguably crucial to detect when data are outside the domain of training data ('out-of-domain') and for avoiding extreme overconfidence, i.e., 'silent catastrophic failure' . In future work, non-parametric models, for example Gaussian Processes, capable of measuring uncertainties about 'out-of-domain' data, should also be explored [44][45][46]55 .
Our work suggests that considerations of robustness to distributional shifts must encapsulate uncertainty and prediction to improve performance in production. While this study focused on the quality of uncertainty, it is important to note that other DL components are worth consideration too. These include model architecture (i.e. inductive bias), which can be tailored to ignore redundant data-specific aspects of a problem via invariant or  57 , and/or structural causal models [58][59][60] . Such tailored models can further improve data efficiency 56 , robustness to distributional shifts 27 , and are central to an appropriate model specification that challenges DL deployment 61 . The importance of tailored inductive biases is supported by the prolific advances in fields beyond clinical diagnostics in computer vision (e.g. CNN's translational equivariance 56 ), and biology (e.g. how Alpha Fold 2 62 solved the Critical Assessment of protein Structure Prediction (CASP 63 ). These studies show that a wide array of DL components can improve generalisation and, thus, DL performance in production. Our study argues uncertainty calibration as an important element in that array; hence, improving the quality of uncertainty can lead to improved DL reliability in production.
In practice, we hope the community considers utilising uncertainty thresholding as a proactive method to improve accuracy and safety of DL applications, deployed in the clinic. This may involve (iterative) consultation between ML engineer and medical professionals to agree on a 'minimally acceptable accuracy' for production (deem this min(F1 dev ) ). The ML engineer may then use development data to train an approximate Bayesian DL model and produce Development F1-Uncertainty curves (with validation data). The engineer then, with another independent dataset, can proceed to develop an ADP estimate (as described in the "Methods" section) to help communicate (in context of available dataset differences) what the expected accuracy decrease may be when the model is deployed to production, which helps manage expectations and facilitate trust. Importantly, with the (prototypical) ADP, the team may better judge which uncertainty quantification techniques are most effective for boosting accuracy under the 'uncertainty thresholding' risk-management regime. This procedure, as well as the ADP statistic, is of course prototypical and only suggestive. We leave improvement, and clarification of this for future work.
In conclusion, our study highlighted approaches for quantifying and improving robustness to shift-induced overconfidence with simple and accessible DL methods in the context of oncology. We justified our approach with mathematical and empirical evidence, biological interpretation, and a new metric, the ADP designed to encapsulate shift-induced overconfidence-a crucial aspect that needs to be considered when deploying DL in real-world production. Moreover, the ADP is directly interpretable as a proxy to expected accuracy loss when deploying DL models from development to production. Although we have addressed the shift-induced overconfidence by utilising first-line solutions, work remains to bridge DL from theory to practice. We must account for data distributions, evaluation metrics, and modelling assumptions as all are equally important and necessary considerations to see safe translation of DL into clinical practice.

Methods
Prediction task and datasets. The task was to predict a patient's primary cancer type, which we cast under the supervised learning framework by learning the map x → y , with y denoting the primary cancer category, and x ∈ R D denoting a patient's sampled bulk gene expression signature.
since we believed it to be approximately independent and identically distributed (IID) data. All other strata were assumed out-of-distribution (OOD) due to distribution shifts caused by confounding variables. As a result, the training and validation data were IID, while the test data were OOD.
Benchmarked models. Four models were benchmarked in this study-the baseline pointwise Resnet, MCD, Bilipschitz, and Ensemble. All models shared identical model architecture and hyperparameter settings (including early stopping), respectively controlling the inductive bias and accuracy from confounding overconfidence. Although we did not perform explicit hyperparameter optimisation, some manual intervention was used to adjust hyperparameters within the validation set. For example, the singular value bound hyperparameter (for spectral normalisation) was manually tuned to be as low as practically possible, while being capable of being flexible enough to learn the training task of predicting the primary site. www.nature.com/scientificreports/ U (0) . Hidden layers have residual connections U (l) = g �U (l−1) , W (l) � + b (l) + U (l−1) where l ∈ 1, 2, . . . , L denotes the hidden layer index ( L = 3 in this case). The final output layer is a pointwise (mean estimate) function in logit-space f (X) = g �U (L) , W (µ) � + b (µ) , where W (µ) , b (µ) are the final output (affine) transformation parameters. Finally, SoftMax normalisation yields a K-vector p(X) = SoftMax(f (X)) . All other hyperparameter settings are defined in Supplementary Table S2. This baseline Resnet model architecture was inherited by all other models in this study to control inductive biases.

Approximate Bayesian inference. Bayesian inference may yield a predictive distribution about sample
x * , p(p|x * , D) , from the likelihood of an assumed parametric model p(p|x * , �) , an (approximate) parametric posterior q |D , and potentially Monte Carlo Integration (MCI) technique, also referred to as Bayesian model averaging: Most neural networks are parametric models, which assume can perfectly represent D . As a result, the model likelihood p(p|x * , D, �) is often replaced with p(p|x * , �) . The main differentiating factor among all Bayesian deep learning inference methods lies in how the parametric posterior q |D is approximated.
Resnet extended with Monte Carlo Dropout. The MCD model approximates the parametric posterior q(�|D) by keeping dropout activated during inference 43 . Dropout randomly 'switches off ' a subset of neurons to zero-vectors at each iteration. Hence, a collection of dropout configurations { t } T t=1 are samples from the (approximate) posterior q(�|D) . For more information, refer to the Appendix of 43 where an approximate dual connection between Monte Carlo Dropout neural networks and Deep Gaussian processes is established.
The MCD also extends the Resnet model architecture by including an additional output layer to estimate a data-dependent variance function s 2 . Both final output layers had a shared input U (L) , but unique parameters . Together, the stochastic mean f t (X) and variance s 2 t (X) specify a Gaussian distribution in the logit-space, which was then sampled once u t (X) ∼ N(µ = f t (X), � = s 2 t (X) T I and normalised with the SoftMax function p t (X) = SoftMax(u t (X)) . p t (X) represents a single sample from the model likelihood p(p|x, �) , from which T samples are averaged for Monte Carlo integration: where scalars L 1 and L 2 respectively control the tightness of the lower-and upper-bound. Norm operators . X , . F are over the data space X and function space F . The effect of the bi-Lipschitz constraint is such that the changes in input data �x 1 − x 2 � χ (e.g. distribution shifts) are proportional to the changes in the output, �f (x 1 ) − f (x 2 )� F . These changes are within a bound determined by L 1 (controlling sensitivity) and L 2 (controlling smoothness). Interestingly, recent studies have established that bi-Lipschitz constraints are beneficial to the robustness of the neural network under distributional shifts 44,45 . Sensitivity (i.e. L 1 ) is controlled with residual connections 66,67 , which allows f (x) to avoid arbitrarily small changes, especially in the presence of distributional shifts in those regions of X with no (training data) support 44 . Sensitivity (i.e. L 2 ) is controlled with spectral normalisation on parameters 44,68 and batch-normalisation functions 45 , which allow f (x) to avoid arbitrarily large changes (under shifts) that induce feature collapse and extreme overconfidence 44-46 . Deep ensemble of Bilipschitz models. The Ensemble model was a collection of eight independently trained Bilipschitz models with unique initial parameter configurations. Each Bayesian model in the Ensemble model is sampled T/10(= 25) times and then pooled to control for Monte Carlo integration between the 'Ensemble' and all other models.
Models in deep ensembles yield similarly performant (low-loss) solutions, but are diverse and distant in parameter-and function-space 69 . This allows the ensemble to have an (approximate) posterior q |D with multiple modes, which was not the case for the Resnet, MCD, and Bilipschitz models. We believe the ensemble modelled q |D with the highest fidelity to the true parametric posterior p |D due to empirical evidence from other studies' results 27,48,70,71 . Model efficacy assessment. Model efficacy was assessed using several metrics with practical relevance in mind (justification provided in the Supplementary Information-S1.2). Predictive performance, the predic- www.nature.com/scientificreports/ tive uncertainties and the total overconfidence were, respectively, measured with the micro-F1 score, Shannon's Entropy II and Expected Calibration Error (ECE). F1-AUC was used to evaluate the robustness of the predictive performance and the uncertainty's error-rate correlation. The Area between Development and Production (ADP) metric was designed to complement F1-AUC by evaluating robustness to shift-induced overconfidence. This may be interpreted as the expected predictive loss during a model's transition from development inference (IID) to production inference (OOD) while controlling for the uncertainty threshold.
Quantifying predictive uncertainty. A predictive uncertainty (or total uncertainty) indicates the likelihood of an erroneous inference p(x) = SoftMax(f (x)) , with a probability vector p(x) ∈ [0, 1] K , normalising operator SoftMax(.) , pointwise SoftMax function in logit-space, f (.) , and an gene expression vector x ∈ R D . The ideal predictive uncertainties depend on the combination of many factors including the training data , model specification (e.g. model architecture, hyperparameters, etc.), inherent noise in data, model parameters , test data inputs x ∈ D test (if modelling heteroscedastic noise), and hidden confounding variables causing distribution shifts. Consequently, there are many statistics, each explaining different phenomena, which make up the predictive uncertainty. Given that some sub-divisions of uncertainty are exclusive to distribution-wise predictive models 72 , we restricted ourselves to uncertainties that are accessible to both pointwise and distribution-wise models, namely, the confidence score, Conf (x) , and Shannon's Entropy H(p(x)).
A model's confidence score with reference to sample x, is defined by the largest element from the SoftMax vector, where ||p(x)|| ∞ denotes the matrix-induced infinity norm of the vector p(x). Confidence scores approximately quantify the probability of being correct and thus they are often used for rejecting 'untrustworthy' predictions (recall 'uncertainty thresholding' from the Introduction). Moreover, an average conf(x) is comparable to the accuracy metric, which allows for evaluating the overconfidence via ECE, which we will shortly detail.
Another notion of predictive uncertainty is that of Shannon's Entropy, i.e., where ., . is the dot product operator. Recall that H p is maximised when p encodes a uniform distribution.
Defining out-of-distribution data and the DL effects. The IID assumption on data implies true causal mechanisms (i.e. structural causal model) where the underlying data generating process is immutable across observations, and hence the samples are independently generated from the same distribution 58  where B m is the number of predictions in bin m , n is the total number of samples, and acc(B m ) and conf (B m ) are the accuracy and confidence scores of bin m , respectively.

Evaluation in OOD using the area under the F1-retention curve (F1-AUC).
Area under the F1-Retention Curve (F1-AUC) was used to evaluate model performance in OOD, as it accounts for both predictive accuracy and an uncertainty's error-rate correlation 47 . High F1-AUC values result from high accuracy (reflected by vertical shifts in F1-Retention curves) and/or high uncertainty error-rate correlation (reflected by the gradient of the F1-Retention curves). An uncertainty's error-rate correlation is important in the production (OOD) context as higher correlations imply more discarded erroneous predictions. F1-AUC was quantified according to the following method.
1. Predictions were sorted by their descending order of uncertainty. 2. All predictions were iterated over in order once, while at each iteration, F1 and retention (initially 100%) were calculated before replacing the current prediction with ground truth, hence decreasing the retention. 3. The increasing F1 scores and the corresponding decreasing retention rates determined the F1-Retention curve. 4. Approximate integration of the F1-Retention curve determined F1-AUC. www.nature.com/scientificreports/ F1-Retention curves and F1-AUC metrics were quantified for all models on OOD data, including samples with classes that were not seen during training.
Using ADP for evaluating models in OOD data relative to IID data. The Area between the Development and Production Curve (ADP) aimed to complement F1-AUC, especially in the context of deploying models from development inference (IID) to production inference (OOD). Thus, ADP was designed to capture (in OOD data, relative to IID) three aspects of a model's robustness relating to the accuracy, uncertainty errorrate correlation, and shift-induced overconfidence. This is because benchmarked inter-model performance can reduce similarly in terms of robustness to accuracy and uncertainty's error-rate correlation (as measured by F1-AUC), but significantly differ by their uncertainty calibration (as measured by ADP).
ADP was calculated according to the following method: 1. Development and Production F1-Uncertainty curves were produced by iteratively calculating F1 and discarding (not replacing) samples by their descending order of uncertainty. 2. A nominal F1 target range of [min(F1 dev ), max(F1 dev )] = [0.975, 0.990] was selected, based on the Development F1-Uncertainty curve; with F1 dev , U accept denoting a point on the Development F1-Uncertainty curve at uncertainty threshold U accept . 3. Nominal F1 target points, F1 nominal , were incremented at 1e-5 intervals from F1 nom = min(F1 dev ) to F1 nom = max(F1 dev ) , with the per cent decrease in F1, from development F1 nom to production F1 prod , recalculated at each step: 4. The set of recalculated Decrease (dev→prod) (F1 nom ) values was averaged to approximate the Area between the Development and Production curves (ADP).
The ADP may be interpreted as "the expected decrease in accuracy when transitioning from development to production if uncertainty thresholding is utilised to boost reliability".
It is important to note that our method for selecting the range [min(F1 dev ), max(F1 dev )] was not arbitrary and required two checks for each model's Development F1-Uncertainty curve. The first check was to ensure the sample size corresponding to max(F1 dev ) was sufficiently large (see Supplementary Table S3). The second check was to ensure that min(F1 dev ) was large enough to satisfy production needs. Failing to undertake these checks may result in the ADP statistic to mislead explanations about the expected loss when deploying models to production. ADP is practically relevant by relating to the uncertainty thresholding technique for improving reliability in production (recall introduction). This is because Decrease (dev→prod) (F1 nom ) first depends on a nominated target performance F1 nom , which selects corresponding U accept from the Development F1-Uncertainty Curve. Predictions with uncertainties below U accept are accepted in production, with performance denoted by F1 prod . As far as the authors are aware, no other metric monitors the three robustness components of accuracy, uncertainty's error-rate correlation, and shift-induced overconfidence.
Ethics approval and consent to participate. This project used RNA-seq data which was previously published or is in the process of publication. The QIMR Berghofer Human Research Ethics Committee approved use of public data (P2095). www.nature.com/scientificreports/